A. The Problem — Free Text Is Unparseable

You know how to prompt — but prompts return text

Agenda

  • A. The Problem — Why free text breaks in production
  • B. Structured Outputs — Constraint sampling to the rescue
  • C. LLMs Are Frozen — The knowledge limitation
  • D. The Naive Approach — Prompt-engineering tool calls (and why it fails)
  • E. The Trained Solution — Fine-tuned formats and the API abstraction
  • F. Schema Design — Practical rules for reliable tool use

A Classification System You Might Build

You want to classify customer feedback automatically.

response = client.chat.completions.create(
    model="claude-haiku-4-5-20251001",
    messages=[{
        "role": "user",
        "content": (
            f"Classify the sentiment of this text: '{text}'\n"
            "Return one word: positive, negative, or neutral."
        )
    }]
)
result = response.choices[0].message.content

This looks reasonable. You test it on ten examples. It works. You ship it.

Five Different Outputs, One Prompt

Same prompt. Five real model outputs. All semantically correct.

Run Output Parseable?
1 Positive Sort of
2 positive Sort of
3 The sentiment is positive. No
4 I'd classify this as positive sentiment. No
5 **Positive** — the tone is clearly upbeat. No

The Core Observation

Every output is semantically correct. None are programmatically consistent.

The Mental Model Shift

What We Need

  • A typed sentiment field
  • A float confidence score
  • A str reasoning field
  • A guarantee these fields exist
  • Code that does result.sentiment

Data

What Free Text Gives

  • A single unstructured string
  • Variable formatting
  • Optional reasoning, maybe
  • No guarantee of any structure
  • Code that does result.lower().strip()

Prose

Note

The shift from “chat” to “data” is the foundation of everything that follows.

B. Structured Outputs — Constraint Sampling

Recall what you know about sampling — then constrain it

How Constraint Sampling Works

Constraint sampling zeros out tokens that would violate the schema — before sampling.

Position Generating… Valid tokens only
1 { only { can open a JSON object
2–12 "sentiment" only valid field name in schema
13–14 ": " structural JSON
15 " opening quote
16–23 positive / negative / neutral only these 3 tokens have non-zero prob
24+ ", "confidence": 0.87, ... rest of schema follows

Tip

This is not post-processing. It happens at sampling time. An invalid token cannot be generated — it has zero probability before the sample is drawn.

The API Call with response_format

class SentimentResult(BaseModel):
    sentiment: Literal["positive", "negative", "neutral"]
    confidence: float
    reasoning: str
response = client.chat.completions.create(
    model="claude-haiku-4-5-20251001",
    messages=[{
        "role": "user",
        "content": f"Classify the sentiment of: '{text}'"
    }],
    response_format=SentimentResult   # ← pass the Pydantic class
)
result = response.choices[0].message.parsed   # typed SentimentResult object
Before After
Access response.choices[0].message.content response.choices[0].message.parsed
Type str SentimentResult
Usage result.lower().strip() result.sentiment
Reliability Fails on edge cases Guaranteed valid

Note

Structured outputs return data. Tool calling takes action. The constraint-sampling mechanism is the same — applied to different problems.

C. LLMs Are Frozen — The Knowledge Problem

Format solved. New problem: the model cannot look things up.

Three Concrete Failure Cases

Live Prices

“What is NVIDIA’s stock price?”

The model will confidently give a number — from training data. That number is stale by months or years.

Recent Events

“What happened in the news today?”

The model cannot know. It will either say “I don’t know” or hallucinate plausible-sounding events.

Private Data

“What is the status of order #9821?”

Your database was never in the training data. The model has no way to access it.

Note

These are structural limitations, not bugs. A more capable model does not fix them. Access to live data does.

The Architectural Answer

The solution is not a better model. It is a better architecture.

sequenceDiagram
    participant U as User
    participant L as LLM
    participant C as Your Code
    participant D as Live Data Source

    U->>L: "What is NVIDIA's stock price?"
    Note over L: I need live data — signal need
    L->>C: (structured signal) fetch_stock_price("NVDA")
    C->>D: API call / database query
    D-->>C: {"price": 142.50, "timestamp": "2026-02-23"}
    C->>L: tool result: $142.50
    L->>U: "NVIDIA is currently trading at $142.50."

The LLM’s job: signal the need for data. Your code: fetch it. The LLM: answer with real information.

The Remaining Question

What We Need

A reliable mechanism for the LLM to signal:

  1. Which function to call
  2. What arguments to pass
  3. In a form your code can parse

Structured. Every time. Not occasionally.

The Naive Approach (Preview)

You already know how to get structured output from a model.

What if you prompt-engineered a special action format?

System: When you need data, output:
ACTION: search
QUERY: <your query>

Let’s try it.

D. The Naive Approach — Prompt-Engineering Tool Calls

The obvious idea — and why it fails

The Obvious Idea

system_prompt = """You are a helpful assistant with access to a search tool.
When you need to look up information, output EXACTLY this format:

ACTION: search
QUERY: <your search query>

Do not add any other text before or after this format.
After I provide the result, continue the conversation normally.
"""

response = client.chat.completions.create(
    model="claude-haiku-4-5-20251001",
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": "What is NVIDIA's current stock price?"}
    ]
)

It Works… Sometimes

On your test set of 20 queries, you get this:

ACTION: search
QUERY: NVIDIA current stock price 2026

Clean. Parseable. Your code splits on \n, checks ACTION:, extracts QUERY:.

You ship it.

One week later, your monitoring dashboard shows a 23% parse error rate.

Four Failure Modes

Failure Actual Output Your Parser
Prose prefix "Sure! ACTION: search\nQUERY: ..." Fails — leading text
Casing variation "action: search\nquery: ..." Fails — lowercase keys
Different phrasing "SEARCH: NVIDIA stock price" Fails — wrong key name
Format ignored entirely "I don't have real-time access to stock prices, but as of my training data..." Fails — no action at all

The Root Cause Is Identical — and Unfixable by Prompting

Same diagnosis as Section A:

Section A: “Return one word: positive/negative/neutral”

→ Instruction. Suggestion. Not a constraint.

Section D: “Output EXACTLY: ACTION: search: …”

→ Instruction. Suggestion. Not a constraint.

Important

More explicit wording, more few-shot examples, even retry-on-failure logic — none of these change the underlying fact: sampling is probabilistic. You cannot enforce format at the token level with text.

E. The Trained Solution — Fine-Tuned Format to API

Same idea — baked into training, abstracted by the API

The Training Insight

The Prompt Approach (Section D)

The system message tells the model to output ACTION: search.

The instruction competes with every other learned pattern in the model’s weights.

Result: inconsistent.

The Training Approach

During supervised fine-tuning (SFT), thousands of examples show the model using a specific token-level format for tool calls.

The format becomes a learned behavior, not an instruction.

Result: reliable.

Tip

The naive system prompt was the right idea — at the wrong level. The solution moves the format contract from the prompt into the model weights themselves.

A Concrete Example — Llama 3.1’s Format

<|python_tag|>{"name": "get_weather", "parameters": {"city": "Riyadh"}}<|eom_id|>

Token-by-token annotations:

Token Role
<\|python_tag\|> Special token: “I am about to make a tool call”
{"name": "get_weather", ...} JSON payload constrained by fine-tuning
<\|eom_id\|> End-of-message: “tool call complete, stop generating”

Note

This is not magic — it is the ACTION: search pattern, baked into fine-tuning data at massive scale. The model learned to produce this format reliably because it saw it thousands of times during training.

Every Major Model Has This Format

Model Trigger Mechanism Notes
Llama 3.1 <\|python_tag\|> special token Open weights — format is public
GPT-4o Internal routing tokens Abstracted behind the API
Claude Internal XML-like tags <function_calls> format in weights
Mistral [TOOL_CALLS] token Similar to Llama approach
Gemini Internal function call tokens Abstracted behind the API

Tip

You don’t need to know the internal format of any model. The OpenAI-compatible API abstracts all of it into a single unified interface.

The API Abstraction

Without the API

# Detect Llama's format manually
if "<|python_tag|>" in content:
    json_str = content.split(
        "<|python_tag|>"
    )[1].split("<|eom_id|>")[0]
    call = json.loads(json_str)
    # Now handle Claude's format...
    # Now handle GPT's format...
    # Now handle Mistral's format...

One parser per model. Brittle. Unscalable.

With the API

# Same code for every model
tool_call = (
    response.choices[0]
    .message
    .tool_calls[0]
)
name = tool_call.function.name
args = json.loads(
    tool_call.function.arguments
)

One interface. Works with every model.

Step 1 — Define the Tool (JSON Schema)

calculator_schema = {
    "type": "function",
    "function": {
        "name": "execute_calculation",
        "description": (
            "Executes arithmetic operations. Use when the user asks to "
            "calculate percentages, rates, compound growth, or any "
            "arithmetic. Example: '15% of 500' → multiply(500, 0.15)"
        ),
        "parameters": {
            "type": "object",
            "properties": {
                "operation": {
                    "type": "string",
                    "enum": ["add", "subtract", "multiply", "divide", "pow"],
                    "description": "The arithmetic operation to perform"
                },
                "operand_a": {"type": "number", "description": "First operand"},
                "operand_b": {"type": "number", "description": "Second operand"}
            },
            "required": ["operation", "operand_a", "operand_b"]
        }
    }
}

Note

description is a prompt to the model — it controls when this tool is selected. The enum on operation applies the same constraint-sampling mechanism as Section B.

Step 2 — API Call 1: Send Tool Definitions

response = client.chat.completions.create(
    model="claude-haiku-4-5-20251001",
    messages=messages,              # conversation so far
    tools=[calculator_schema],      # ← inject tool definitions
    tool_choice="auto",             # ← model decides whether to call
    temperature=0.1                 # ← low temp: more deterministic
)
  • tools — list of JSON Schema definitions; the model reads these as context
  • tool_choice="auto" — model can call a tool or respond directly
  • tool_choice="required" — forces a tool call (useful for testing)
  • temperature=0.1 — allows slight variation in reasoning, not in schema

Step 3 — Parse the Response

The API response when the model wants to call a tool:

{
  "choices": [{
    "message": {
      "role": "assistant",
      "content": null,
      "tool_calls": [{
        "id": "call_abc123",
        "type": "function",
        "function": {
          "name": "execute_calculation",
          "arguments": "{\"operation\": \"multiply\", \"operand_a\": 500, \"operand_b\": 0.15}"
        }
      }]
    }
  }]
}
tool_call = response.choices[0].message.tool_calls[0]
name     = tool_call.function.name                       # "execute_calculation"
args     = json.loads(tool_call.function.arguments)      # dict
tool_id  = tool_call.id                                  # "call_abc123" — keep this

Tip

Always wrap json.loads() in a try/except — rare edge cases can produce malformed JSON.

Step 4 — Execute and Feed Back

# 1. Execute your local function
result = execute_calculation(**args)   # {"success": True, "result": 75.0}

# 2. Append the assistant's tool_call message first
messages.append(response.choices[0].message)

# 3. Append the tool result with the matching ID
messages.append({
    "role": "tool",
    "tool_call_id": tool_call.id,     # ← must match tool_calls[0].id exactly
    "content": json.dumps(result)
})

Important

tool_call_id is the linking key. The model uses it to match its request to your result. If the IDs don’t match, the model cannot construct its final answer correctly.

Step 5 — API Call 2: Final Answer

# Second call — model sees the tool result and responds to the user
final_response = client.chat.completions.create(
    model="claude-haiku-4-5-20251001",
    messages=messages   # contains: user + tool_call + tool_result
)
print(final_response.choices[0].message.content)
# → "15% of 500 is 75."

The Complete Picture

G A User Message B API Call 1: messages + tools A->B C tool_calls? B->C D Return text response C->D No E Parse tool_calls (name + args + id) C->E Yes F Execute local function E->F G Append tool result with tool_call_id F->G H API Call 2: updated messages G->H I Return final response H->I

You now understand why every node exists — not just what to type.

F. Schema Design & Wrap-up

Practical rules for making it all work reliably

The Description Field Is a Prompt

Weak description

"description": "Does math"

The model will call this for arithmetic — maybe. It will not call it for “what is 15% of 500?” because the description gives no hints about use cases.

Strong description

"description": "Executes arithmetic
operations. Use when the user asks
to calculate percentages, rates,
compound growth, or any arithmetic.
Example: '15% of 500' 
multiply(500, 0.15)"

Purpose + examples + edge cases = reliable triggering.

Tip

Complete the sentence: “Use this tool when the user wants to…” — if your description doesn’t finish that sentence clearly, the model will miss calls.

Enums Are the Single Biggest Win

G A User: 'add 5 and 3' B Schema Type? A->B C Ambiguous output: 'plus', 'Add', 'addition' B->C String (free text) D Precise output: 'add' B->D Enum (constrained)

Note

Same constraint-sampling mechanism as Section B’s Literal["positive","negative","neutral"] — now applied to tool parameters. One pattern, two applications.

Structuring Nested Parameters

"parameters": {
    "type": "object",
    "properties": {
        "location": {
            "type": "object",
            "properties": {
                "city":    {"type": "string", "description": "City name in English"},
                "country": {"type": "string", "enum": ["SA", "US", "UK", "AE"]}
            },
            "required": ["city"]          # ← required at the nested level too
        },
        "price_range": {
            "type": "string",
            "enum": ["budget", "mid", "luxury"]
        }
    },
    "required": ["location"]              # ← required at the top level
}

Apply required at every level of nesting. A missing required field causes a silent partial call that your code cannot handle cleanly.

Structured Error Returns

The tool always returns a dict. It never raises an uncaught exception.

def execute_calculation(operation, operand_a, operand_b):
    try:
        if operation == "divide" and operand_b == 0:
            return {"success": False,
                    "result": None,
                    "error": "Division by zero is not allowed."}
        # ... perform operation ...
        return {"success": True, "result": answer, "error": None}
    except Exception as e:
        return {"success": False, "result": None,
                "error": f"Calculation error: {str(e)}"}

The LLM receives the structured error as a tool message and explains it naturally to the user — no stack traces exposed, conversation continues.

Schema Design Checklist

Principle What to Do
Verb names get_weather, execute_calculation, search_hotels
Rich descriptions Purpose + examples + edge cases
Enums over strings Constrain values wherever a fixed set exists
Explicit required List every mandatory field at every nesting level
Nested objects Group related params into sub-objects, not flat strings
Structured errors {success, result, error} — always a dict, never an exception
Low temperature temperature=0.1 for tool-calling calls

The Decision Tree

graph LR
    Start{"What is your goal?"}
    Start -->|"Return data"| SO["Structured Outputs"]
    Start -->|"Take action"| TC["Tool Calling"]
    Start -->|"Talk to human"| Chat["Standard Chat"]
    SO -->|"Example"| SEx["Extract entities<br/>Classify text<br/>Format reports"]
    TC -->|"Example"| TEx["Search database<br/>Call API<br/>Run calculation"]
    Chat -->|"Example"| CEx["Explain concept<br/>Write poem<br/>Brainstorm ideas"]

    style SO fill:#9B8EC0,stroke:#1C355E,color:white
    style TC fill:#00C9A7,stroke:#1C355E,color:#1C355E
    style Chat fill:#FF7A5C,stroke:#1C355E,color:white

Key Takeaways

  1. Free text is a format contract problem — instructions are suggestions; constraints are structural.
  2. Structured Outputs enforce contracts at sampling time — the model cannot produce an invalid token.
  3. LLMs are frozen snapshots — live data, recent events, and private data require an architectural solution.
  4. Prompt-engineered tool formats are unreliable — same root cause as unstructured text output.
  5. Tool calling works because models are trained on a specific format — the API abstracts this format away.
  6. The two-call loop is the core protocol — API Call 1 gets the tool request; your code executes; API Call 2 gets the answer.
  7. Schema quality determines tool reliability — descriptions, enums, required fields, and structured errors are not optional.

G. Architecture Preview

Where we’re heading in Lab 3

The If-Else Trap

As your tool count grows, manual dispatch becomes unmanageable:

def run_agent(tool, args):
    if tool == "calc":
        return run_calc(**args)
    elif tool == "email":
        return run_email(**args)
    elif tool == "files":
        return run_files(**args)
    # ... 50 more tools?

Important

Every new tool requires modifying the dispatcher. No security, no rate limiting, no consistency.

The Plugin Contract

Every tool implements the same interface — an Abstract Base Class:

class BaseTool(ABC):
    @property
    @abstractmethod
    def name(self) -> str: ...

    @property
    @abstractmethod
    def description(self) -> str: ...

    @property
    @abstractmethod
    def parameters(self) -> dict: ...

    @abstractmethod
    def execute(self, **kwargs) -> dict: ...

Each tool encapsulates schema + logic + permissions in one cohesive unit.

The Registry Pattern

Dynamic dispatch — agent code never changes when you add a tool:

class ToolRegistry:
    def register(self, tool: BaseTool): ...
    def get_schemas(self) -> list: ...
    def execute(self, name, args) -> dict: ...

# Usage
registry.register(CalculatorTool())
registry.register(ListFilesTool())
result = registry.execute(tool_name, args)

Goal: Add tools by registering them, not by editing dispatchers.

Defense in Depth

Multiple security layers protect against edge cases:

G A Input B Schema Validation A->B C Permission Check B->C D Rate Limit C->D E Sanitization D->E F Execute E->F

Lab Preview

Up Next

Lab 1 — Schema Gym (no API key required)

  • Part 1 (30 min): Design and validate tool schemas with Pydantic
  • Part 2 (40 min): Implement the complete two-call tool-calling loop

Lab 2 — Calculator Tool

Build a production-quality tool-calling agent with full error handling, structured outputs, and observability.

Lab 3 — Plugin Framework (Session 2)

Build the full architecture: BaseTool, ToolRegistry, RateLimiter, and PathSanitizer. Pure Python OOP — no API keys needed.